-
Notifications
You must be signed in to change notification settings - Fork 14
Feat/Introduce AI semantic search! #980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4fa576b to
a4db7f3
Compare
|
Will review later, but this will need infra changes before being merged. |
…-k to threshold=4.5
a4db7f3 to
031bb9e
Compare
|
Hey guys! I've implemented semantic search on Berkeleytime using BGE embeddings. It already works pretty well, but I’d like to fine-tune it specifically for Berkeley courses. 🎯 I need your help building a small training dataset. Please actually try some searches on Berkeleytime (using the new semantic search), and whenever you see results that look clearly wrong or surprising, send me an example in this format: Where:
What I’m especially looking for:
Goal: ~50–100 examples total. |
* disable sections/lectures, scroll hide bug, event clear bug, calendar month bug, leftborder color bug * fix: hardcode color for sidebar header * fix: urgent bug of cannot adding a class which is not in primarySection (access _class before initialization) * fix: minor format error --------- Co-authored-by: vaclis.mbp <[email protected]>
* hasCatalogItem true by default * classes datapuller populate terms * Avoid N+1 enrollment fetches in getCatalog * clean up
Restores courseId, course.subject, and course.number fields to GET_CANONICAL_CATALOG_QUERY that were removed in d0a37f1. These fields are essential for cross-listed course functionality: - courseId: Required by Class.course resolver (class/resolver.ts:112) to fetch the parent course record for cross-listed courses like DATA C100 / STAT C100 which share courseId but have different subjects - course.subject & course.number: Required by the resolver override (class/resolver.ts:118-119) and grade lookup key generation (catalog/controller.ts:320) to ensure each cross-listed variant displays the correct department name and historical grade distribution Without these fields: - Cross-listed courses cannot be identified (no courseId linking) - Grade distributions show incorrect data (wrong lookup keys) - Course metadata displays wrong department (no subject override) Database verification shows courseId '148047' links DATA C100 and STAT C100 across multiple terms, confirming the need for these fields. Note: termId/sessionId not restored as catalog controller pre-populates enrollment data, making those fields unnecessary for this query.
…isting The catalog controller was not overriding course.subject and course.number with class-specific values, causing cross-listed courses to be indexed and displayed with incorrect department names. Issue: - DATA C100 and STAT C100 share courseId '148047' - The parent course record has subject: 'STAT', number: 'C100' - When catalog fetches DATA C100, it was using the parent course's subject - This caused getIndex() (line 33) to index DATA C100 as 'STAT C100' - Search results would show wrong department for cross-listed variants Fix: Added override logic after formatCourse() to set: - formattedCourse.subject = _class.subject (DATA, not STAT) - formattedCourse.number = _class.courseNumber (C100) This matches the behavior in Class.course resolver (class/resolver.ts:118-119) and ensures consistency across all code paths that access course metadata. Benefits: - Search indexing uses correct subject for each cross-listed variant - Course metadata displays correct department in all contexts - Behavior now consistent between catalog controller and GraphQL resolver
* reassign color to fit timeframe * dynamically determined time * rename csv * only sunrise/sunset, daytime, nightime --------- Co-authored-by: maxmwang <[email protected]>
… cross-listing" This reverts commit d87931b.
…log query" This reverts commit 8ec4795.
* feat: adding enrollment tab into catalog * fix: minor format * Make enrollment graph same as the Enrollment page * linting * styling improvements --------- Co-authored-by: PineND <[email protected]>
* pill v1 * finished section * small fixes * Update apps/frontend/src/components/Class/Sections/Sections.module.scss Co-authored-by: Copilot <[email protected]> * copilot fixes * accessible table * location support --------- Co-authored-by: Copilot <[email protected]>
3e0ecce to
94d9609
Compare
Summary
Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models (based on Jacky’s work from last semester), along with backend proxy endpoints and updated catalog UX, enabling AI course search.
System Architecture
flowchart LR %% ---------- Frontend ---------- subgraph Frontend FE_UI["Search Bar + AI Search Toggle"] end %% ---------- Node Backend ---------- subgraph NodeBackend ProxyRouter["/api/semantic-search/* (proxy router)"] CoursesAPI["/api/semantic-search/courses (lightweight endpoint)"] GraphQLResolvers["GraphQL resolvers + hasCatalogData"] end %% ---------- Python Semantic Service ---------- subgraph SemanticService["Semantic Search Service (FastAPI)"] Health["/health"] Refresh["/refresh (rebuild FAISS index)"] Search["/search (threshold-based semantic query)"] BGE["BGE Embedding Model"] FAISS["FAISS Index (cosine similarity)"] end %% ---------- Catalog Data Puller ---------- subgraph CatalogData DataPuller["GraphQL Catalog Datapuller"] end %% ---------- Data Flow ---------- FE_UI -->|Search Query| CoursesAPI CoursesAPI -->|Forward to Python| Search Search -->|Generate Query Embedding| BGE Search -->|Vector Similarity Search| FAISS FAISS -->|Threshold-filtered Results| Search Search --> CoursesAPI --> FE_UI %% Index refresh / data ingestion DataPuller --> GraphQLResolvers --> |TODO:|Refresh Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers Refresh -->|Generate Embeddings| BGE --> FAISSExamples
Input: “Memory models in concurrent programming”
→ Should return courses like databases, operating systems, etc.
→ Should not return biology or psychology courses just because of the word “memory.”
Input: “how to shot a hot vlog”

Implementation Details
Python Semantic Search Service (FastAPI)
FastAPI microservice (
apps/semantic-search) that:Key endpoints:
/health— readiness probe showing index status/refresh— rebuild FAISS index for a given year/semester/search— semantic query with threshold filteringModel Architecture:
"Represent this sentence for searching relevant passages: {query}"SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}Example: manually refreshing an index
curl -X POST http://localhost:8000/refresh \ -H 'Content-Type: application/json' \ -d '{"year": 2026, "semester": "Spring"}' | jqExample: running a semantic search
Backend Integration (Node / Express)
Added
SEMANTIC_SEARCH_URLenvironment variable pointing to Python serviceImplemented lightweight proxy endpoint
/api/semantic-search/courses:{subject, courseNumber, score}for efficient frontend filteringUpdated GraphQL behavior:
hasCatalogDatafield for term filteringterms(withCatalogData: true)Frontend (Catalog UI)
Technical Decisions
Why BGE over other models?
Why threshold instead of top-k?
Model Options Available (hardcoded in
main.py)Next Steps
Datapuller Integration: TOP PRIORITY!
Automatically trigger
/refreshendpoint when new catalog data is pulledFine-tuning for Berkeley Courses
Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search
e. Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms
Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND